Acquisition of Medical Terminology for Ukrainian from Parallel Corpora and Wikipedia

نویسندگان

  • Thierry Hamon
  • Natalia Grabar
چکیده

The increasing availability of parallel bilingual corpora and of automatic methods and tools for their processing makes it possible to build linguistic and terminological resources for low-resourced languages. We propose to exploit various corpora available in several languages in order to build bilingual and trilingual terminologies. Typically, terminology information extracted in French and English is associated with the corresponding units in the Ukrainian corpus thanks to the multilingual transfer. According to the used approaches, precision of the term extraction varies between 0.454 and 0.966, while the quality of the interlingual relations varies between 0.309 and 0.965. The resource built contains 4,588 medical terms inUkrainian and their 34,267 relations with French and English terms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SimpleScience: Lexical Simplification of Scientific Terminology

Lexical simplification of scientific terms represents a unique challenge due to the lack of a standard parallel corpora and fast rate at which vocabulary shift along with research. We introduce SimpleScience, a lexical simplification approach for scientific terminology. We use word embeddings to extract simplification rules from a parallel corpora containing scientific publications and Wikipedi...

متن کامل

Exploiting a Multilingual Web-based Encyclopedia for Bilingual Terminology Extraction

Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopedias such as Wikipedia as comparable corpora for bilingual terminology e...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Exploiting BabelNet for Multilingual Biomedical Synonym Expansion

Our challenge contribution for CLEF-­‐ER consists in providing annotations for all three corpora of the challenge (Medline, EMEA, Patents) for the languages French and German. The objective of these experiments is to verify whether a general multilingual ontological resource as BabelNet (http://babelnet.org) can be used to substantially enrich the terminology provided by the challenge organizer...

متن کامل

Constructing a Chinese―Japanese Parallel Corpus from Wikipedia

Graduate School of Informatics, Kyoto University Yoshida-honmachi, Sakyo-ku, Kyoto, 606-8501, Japan E-mail: {chu, nakazawa}@nlp.ist.i.kyoto-u.ac.jp, [email protected] Abstract Parallel corpora are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese–Japanese. As comparable corpora are far more available, many studies have ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015